Resource based virtual computing instance scheduling

ABSTRACT

Examples provide two-tiered scheduling within a cluster. A coarse-grained analysis is performed on a candidate set of hosts to select a host for a virtual computing instance based on optimization of at least one resource. A host is selected based on the analysis results. The identified virtual computing instance is placed on the selected host. A fine-grained analysis is performed on a set of communication graphs for a plurality of virtual computing instances to generate a set of penalty scores. A set of communicating virtual computing instances are selected based on the set of penalty scores. A first virtual computing instance from a first host is relocated to a second host to minimize a distance between the first virtual computing instance and a second virtual computing instance. Relocating the first virtual computing instance reduces at least one penalty score for the set of communicating virtual computing instances.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a divisional of U.S. patent application Ser. No.15/283,274, filed Sep. 30, 2016 and entitled RESOURCE BASED VIRTUALCOMPUTING INSTANCE SCHEDULING, which is hereby incorporated by referenceherein in its entirety.

BACKGROUND

A cluster is a collection of hosts in which processor, memory, storage,and other hardware resources are aggregated for utilization by the hostsin the cluster. A host is capable of running one or more virtualcomputing instances, such as virtual machines (VMs). A VM typicallyincludes an operating system (OS) running one or more applications toperform a workload. VMs running on a host utilize cluster resources toperform the workloads. However, if a VM is placed on a host withinsufficient resources available to meet the resource demands of theVMs, the host becomes overloaded.

In some existing solutions, one or more VMs on an overloaded host may berelocated to a different host in the cluster in an attempt to remediatethe overloaded host. A scheduler is utilized in some systems to select ahost for placement of VMs and balance the resource utilization among thehosts in the cluster. However, these placement and relocation decisionsare frequently made based on insufficient information regarding resourcedemands of the VMs and resource availability of the hosts. Thisfrequently results in sub-optimal placement of VMs, unbalanced hosts,network saturation, overloading of network links, and/or overallinefficient utilization of available cluster resources.

SUMMARY

Examples of the disclosure provide a two-tiered scheduler. A selectioncomponent selects a candidate set of hosts from a plurality of hostswithin a cluster. The plurality of hosts is associated with a set ofvirtual computing instances. A coarse-grained scheduler componentperforms a coarse-grained optimization on the candidate set of hosts toselect a host for an identified virtual computing instance based on atleast one resource. The identified virtual computing instance is placedon the selected host. A fine-grained scheduler component relocates afirst virtual computing instance in a set of communicating virtualcomputing instances from a first host in the cluster to a second host inthe cluster based on at least one penalty score associated with the setof communicating virtual computing instances. Relocating the firstvirtual computing instance to the second host reduces at least onepenalty score.

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an exemplary block diagram illustrating a system for atwo-tiered scheduler.

FIG. 2 is an exemplary block diagram illustrating a host computingdevice.

FIG. 3 is an exemplary block diagram illustrating a two-tieredscheduler.

FIG. 4 is an exemplary block diagram illustrating a coarse-grainedscheduler.

FIG. 5 is an exemplary block diagram illustrating a fine-grainedscheduler.

FIG. 6 is an exemplary flow chart illustrating operation of a two-tieredscheduler.

FIG. 7 is an exemplary flow chart illustrating operation of acoarse-grained scheduler.

FIG. 8 is an exemplary flow chart illustrating operation of afine-grained scheduler.

FIG. 9 is an exemplary graph illustrating system utilization ofdifferent scheduling algorithms under different system sizes.

FIG. 10 is an exemplary graph illustrating system imbalance of differentscheduling algorithms under different system sizes.

FIG. 11 is an exemplary graph illustrating algorithm runtime ofdifferent scheduling algorithms under different system sizes.

FIG. 12 is an exemplary graph illustrating algorithm runtime of amultiqueue-K algorithm using different top K candidates.

FIG. 13 is an exemplary graph illustrating system utilization ofmultiqueue-K algorithms using different top K candidates.

FIG. 14 is an exemplary graph illustrating reduction of communicationcost of different scheduling algorithms over random scheduler.

FIG. 15 is a block diagram of an example host computing device.

FIG. 16 is a block diagram of VMs instantiated on a host computingdevice.

Corresponding reference characters indicate corresponding partsthroughout the drawings.

DETAILED DESCRIPTION

Referring to the figures, examples of the disclosure enable acoarse-grained scheduler and a fine-grained scheduler for network awaredistributed resource scheduling. In some examples, a selection componentselects a predetermined number of hosts from each queue in a pluralityof resource-based host queues to create a candidate set of hosts. Apowering-on virtual computing instance or a migrated virtual computinginstance is placed on a host selected from the candidate set of hosts.Selecting a host from the candidate set rather than selecting the hostfrom all hosts in a cluster increases host selection speed, reducesprocessor load, and conserves memory.

In other examples, the candidate set of hosts is selected from aplurality of hosts based on sampling from priority queues. This enablesa scheduler that is scalable in both virtual computing instance and hostdimensions because it uses sampling instead of an exhaustive search tofind a suitable host for an incoming or resource-wise distressed virtualcomputing instance. This sampling further enables schedulers to operatein larger-scale environments.

A coarse-grained scheduler selects a host from the candidate set ofhosts based on a resource-based optimization, including networkingresource utilization and networking requirements of the virtualcomputing instances. Providing a scheduler that is capable ofconsidering multiple resource metrics, including networking metrics,enables more accurate and reliable selection of hosts to reduceprocessor load, prevent host network saturation, improve clusterresource utilization, and enable more efficient placement of virtualcomputing instances on hosts in a cluster.

The coarse-grained scheduler performs a coarse-grained,infrastructure-level optimization to select a host for an identifiedvirtual computing instance. This enables the scheduler to provide animproved distribution of load, increase virtual computing instancepacking density on hosts, and minimize virtual computing instancerejection by the scheduler.

Aspects of the disclosure also contemplate a fine-grained scheduler thatco-locates a pair of communicating virtual computing instances tominimize a distance between the pair of communicating virtual computinginstances, such as two or more virtual machines (VMs). A communicatingvirtual computing instance is a virtual computing instance that isengaging in communications with another virtual computing instance on asame host or communicating with another virtual computing instance on adifferent host. Virtual computing instances utilize network resources toenable communications between the different virtual computing instances.

In some examples, the pair of VMs are co-located by placing the pair ofVMs on the same host or placing the pair of VMs on two different hostswithin a same rack of a rack scale architecture (RSA). Reducing thedistance between communicating VMs improves communication speed betweenVMs, reduces network bandwidth usage, and improves network resourceefficiency.

While some embodiments are described with reference to VMs for clarityof description, the disclosure is operable with other virtual computinginstances. A virtual computing instance is a VM, a container, or anyother type of virtualized computing instance. A host supports a VM, acontainer, and/or any other virtual computing instances. A host may beimplemented as a physical server, or as a cloud instance running withina cloud environment. A cloud instance of a host is a host running withina VM, a container, or other virtual computing instances. This may beimplemented as a first hypervisor running within a VM, which is runningover a second hypervisor, in some examples. A cloud instance of a hostruns within a virtual computing instance, while supporting one or moreother computing instances. A VM running within a cloud instance of ahost may be referred to as a nested VM.

Referring to FIG. 1, an exemplary block diagram illustrates a system fora two-tiered scheduler. The system 100 in this non-limiting exampleoptionally includes a cloud 102. The cloud 102 may be implemented as aprivate cloud, a public cloud, or a hybrid cloud. A hybrid cloud is acloud that includes a public cloud and a private cloud. VMware's vCloudHybrid Services (vCHS) is an example of a hybrid cloud implementation.

The cloud 102 in this example is a cloud computing platform. In someexamples, the cloud 102 runs one or more virtual computing instances,such as, but not limited to, VMs in a set of VMs 104. Cloud servicesassociated with the cloud 102 are provided to clients via a network 106.The network 106, in some examples, is a Wide Area Network (WAN)accessible to the public, such as the Internet. The cloud services areprovided via one or more physical servers, such as a set of servers 112associated with data center 110.

In this example, the data center 110 includes one or more physicalcomputing devices in the set of servers 112 and/or data storagedevice(s) 118. The set of servers 112 may include a single server, aswell as two or more servers in a cluster 116. The cluster 116 is a groupof two or more physical server devices. In some examples, the cluster116 is implemented as a VMWare vSphere cluster.

In some examples, the set of servers 112 includes an RSA housing aplurality of physical servers. In yet other examples, the set of servers112 includes one or more blade servers.

The set of servers 112 in this non-limiting example hosts a set of VMs114. The set of VMs 114 includes one or more VMs running on one or moreservers.

The data storage device(s) 118 in this non-limiting example includes oneor more devices for storing data. The data storage device(s) 118 may beimplemented as any type of data storage, including, but withoutlimitation, a hard disk, optical disk, a redundant array of independentdisks (RAID), a solid state drive (SSD), a flash memory drive, a storagearea network (SAN), or any other type of data storage device. The datastorage device(s) 118 may include rotational storage, such as a disk.The data storage device(s) 118 may also include non-rotational storagemedia, such as SSD or flash memory.

In some non-limiting examples, the data storage device(s) 118 provide ashared data store. The shared data store is a data storage accessible bytwo or more hosts in the cluster 116.

In some examples, the system 100 optionally includes a remote datastorage device, such as data storage device 120. The remote data storagedevice 120 is accessible by the set of servers 112 via the network 106.

The scheduler 108 in this non-limiting example, is a two-tiered, networkaware distributed resource scheduler including a coarse-grainedscheduler and a fine-grained scheduler. In other examples, the scheduler108 only includes the coarse-grained scheduler without the fine-grainedscheduler. In still other examples, the scheduler 108 includes only afine-grained scheduler without the coarse-grained scheduler.

The scheduler 108, in this example, runs on one or more computingdevices associated with the data center 110, such as a server in the setof servers 112. In other examples, the scheduler 108 optionally executesin the cloud 102.

The set of VMs 114 in the cluster 116 may have highly diverse resourcerequirements along central processing unit (CPU), memory, andinput/output (I/O) dimensions. Prior art schedulers typically handle VMplacement and load balancing for CPU, memory, and storage. However,these schedulers typically do not consider either the VMs' or the hosts'networking behavior when performing VM placement or load-balancing. Thisfrequently results in sub-optimal VM placements and relocations causinghost network saturation and overloading of network links incore/aggregation level. In contrast, the two-tiered scheduler 108performs scalable VM placement with multiple resource types and performsVM relocation to relieve resource contention as well as co-locate chattyvirtual machines.

Moreover, elastic resource provisioning in a software defined datacenter(SDDC) is frequently managed by a number of different schedulersmanaging different resources independently of one another. For example,some systems computer resources, such as CPU and memory, are managed bya resource scheduler, storage resources are managed by a storagescheduler, and network resources are managed by a network scheduler.These different schedulers operate independently from each other andfrequently work on different sets of input metrics. The utilization ofthese disparate schedulers also results in sub-optimal VM placements andinefficient resource management in the datacenter.

A distributed resource scheduler (DRS) is a prior art scheduler formanaging resources in a cluster, such as CPU, memory and storage. Insome examples, the primary metric the scheduler optimizes is dynamicentitlement. This metric reflects resource delivery in accordance withboth the needs and importance of the VMs and is a function of the VMsactual resource demands, overall cluster capacity, and the VMs resourcesettings. The VMs resource settings may include reservations, limits,and shares. A reservation is a claim or guarantee on a specific amountof a resource should the VM demand it. A VMs entitlement for a resourceis higher than its reservation and lower than its limit. Dynamicentitlement is equal to VM demand if there are sufficient resources inthe cluster to meet all VM demands. Otherwise, it is scaled down basedon cluster capacity, the demands of other VMs, and its settings forreservations, shares, and limits.

A DRS scheduler typically computes host load (its normalizedentitlement) by summing up the entitlements of the VMs running in it andnormalizing it using the hosts capacity. This normalized entitlement isthen used to calculate the cluster balance metric, which is the standarddeviation of the normalized entitlements of hosts. The primary target ofthe optimization algorithm is to move the standard deviation value closeto zero when making placement decisions or load-balancing.

However, the DRS scheduler does not fully factor in networking resourceswhen making VM-placement decisions. Support for reservations on a VM'soutbound bandwidth, such as transmit bandwidth, allows DRS to perform anadmission control check to ensure that the sum of network reservationson a host do not exceed its capacity. However, DRS does not consideractual usage of a host's network interface controllers (NICs).

In contrast with these prior art schedulers, the scheduler 108 in thepresent disclosure manages compute, storage, memory, and networkresources together. The scheduler does not consider network resourceusage independently from CPU usage. Even with support for hardwareoffloading, processor cycles are needed to drive traffic. The positivecorrelation between compute and networking places networking as asecondary, dependent resource rather than a primary, independentresource.

Networking resources have on-host and off-host components. The on hostcomponents may include physical network interface controller (NIC). Theoff-host components include switch and rack.

FIG. 2 is a block diagram of a host computing device for serving one ormore VMs. The illustrated host computing device 200 may be implementedas any type of host computing device, such as a server. In somenon-limiting examples, the host computing device 200 is implemented as ahost or ESXi host from VMware, Inc. The host computing device 200 is ahost for running one or more VMs.

The host computing device 200 represents any device executinginstructions (e.g., as application(s), operating system, operatingsystem functionality, or both) to implement the operations andfunctionality associated with the host computing device 200. The hostcomputing device 200 may include desktop personal computers, kiosks,tabletop devices, industrial control devices, or server, such as, butnot limited to, a server in the set of servers 112 in FIG. 1. In someexamples, the host computing device 200 is implemented as a blade serverwithin a RSA. Additionally, the host computing device 200 may representa group of processing units or other computing devices.

The host computing device 200 includes a hardware platform 202. Thehardware platform 202, in some examples, includes one or moreprocessor(s) 204, a memory 206, and at least one user interface, such asuser interface component 208.

The processor(s) 204 includes any quantity of processing units, and isprogrammed to execute computer-executable instructions for implementingthe examples. The instructions may be performed by the processor or bymultiple processors within the host computing device 200, or performedby a processor external to the host computing device 200. In someexamples, the one or more processors are programmed to executeinstructions such as those illustrated in the figures (e.g., FIG. 6,FIG. 7, and FIG. 8).

The host computing device 200 further has one or more computer readablemedia, such as the memory 206. The memory 206 includes any quantity ofmedia associated with or accessible by the host computing device 200.The memory 206 may be internal to the host computing device 200,external to the host computing device, or both. In some examples, thememory 206 includes read-only memory (ROM) 210.

The memory 206 further stores a random access memory (RAM) 210. The RAM210 may be any type of random access memory. In this example, the RAM210 is part of a shared memory architecture. In some examples, the RAM210 may optionally include one or more cache(s). The memory 206 furtherstores one or more computer-executable instructions 214.

The host computing device 200 may optionally include a user interface208 component. In some examples, the user interface 208 includes agraphics card for displaying data to the user and receiving data fromthe user. The user interface 208 may also include computer-executableinstructions (e.g., a driver) for operating the graphics card. Further,the user interface 208 may include a display (e.g., a touch screendisplays or natural user interface) and/or computer-executableinstructions (e.g., a driver) for operating the display. The userinterface component may also include one or more of the following toprovide data to the user or receive data from the user: speakers, asound card, a camera, a microphone, a vibration motor, one or moreaccelerometers, a BLUETOOTH brand communication module, globalpositioning system (GPS) hardware, and a photoreceptive light sensor.

In some examples, the hardware platform 202 optionally includes anetwork communications interface component 216. The networkcommunications interface component 216 includes a network interface cardand/or computer-executable instructions (e.g., a driver) for operatingthe network interface card. Communication between the host computingdevice 200 and other devices may occur using any protocol or mechanismover any wired or wireless connection. In some examples, thecommunications interface is operable with short range communicationtechnologies such as by using near-field communication (NFC) tags.

The data storage device(s) 218 may be implemented as any type of datastorage, including, but without limitation, a hard disk, optical disk, aredundant array of independent disks (RAID), a solid state drive (SSD),a flash memory drive, a storage area network (SAN), or any other type ofdata storage device. The data storage device(s) 218 may includerotational storage, such as a disk. The data storage device(s) 218 mayalso include non-rotational storage media, such as SSD or flash memory.

In some non-limiting examples, the data storage device(s) 218 provide ashared data store. A shared data store is a data storage accessible bytwo or more hosts in a host cluster.

The host computing device 200 hosts one or more virtual computinginstances, such as, but not limited to, VMs 220 and 222. The VM 220 insome examples, includes data such as, but not limited to, one or moreapplication(s) 224. The VM 222 in this example includes applications(s)226. The application(s), when executed by the processor(s) 204, operateto perform functionality on the host computing device 200.

Exemplary application(s) include, without limitation, mail applicationprograms, web browsers, calendar application programs, address bookapplication programs, messaging programs, media applications,location-based services, search programs, and the like. Theapplication(s) may communicate with counterpart applications or servicessuch as web services accessible via a network. For example, theapplications may represent downloaded client-side applications thatcorrespond to server-side services executing in a cloud.

In some examples, modern enterprise applications run in datacenterenvironments are distributed in nature and usually I/O intensive. Eachcomponent of such distributed applications are packed into individualVMs and deployed in clusters of physical machines, such as, but notlimited to, VMware vSphere clusters. In these examples, each componenthas different resource demands. Each of the VMs running a component of adistributed application may also have highly diverse resourcerequirements. The two-tiered scheduler 132 shown in FIG. 1 in someexamples, performs an infrastructure optimization such that theapplication(s) running inside one or more VMs is allotted the necessaryresources to run.

In this example, each VM includes a guest operating system (OS). In thisexample, VM 220 includes guest operating system (OS) 228 and VM 222includes guest OS 230.

The host computing device 200 further includes one or more computerexecutable components. Exemplary components include a hypervisor 232.The hypervisor 232 is a VM monitor that creates and runs one or moreVMs, such as, but without limitation, VM 220 or VM 222. In one example,the hypervisor 232 is implemented as a vSphere Hypervisor from VMware,Inc.

The host computing device 200 running the hypervisor 232 is a hostmachine. VM 220 is a guest machine. The hypervisor 232 presents the OS228 of the VM 220 with a virtual hardware platform. The virtual hardwareplatform may include, without limitation, virtualized processor 234,memory 236, user interface device 238, and network communicationinterface 240. The virtual hardware platform, VM(s) and the hypervisorare illustrated and described in more detail in FIG. 16 below.

FIG. 3 is an exemplary block diagram illustrating a two-tieredscheduler. The scheduler 300 in this example is a two-tiered schedulerfor placing virtual machines on hosts and relocating VMs to remediateresource contention as well as improve application metrics.

In this example, the scheduler 300 includes a coarse-grained scheduler302 and a fine-grained scheduler 306. The coarse-grained scheduler 302finds a suitable host for an incoming or distressed virtual computinginstance to optimize for infrastructure.

The scheduler 302, in some examples, selects a host for a powering-on VMsuch that it can optimize for the demands of both infrastructureprovider and the infrastructure user. The coarse-grained scheduler 302in some examples includes a resource based optimizer 304. In thisexample, the optimizer 440 includes a sampling based packing algorithmfor performing a coarse-grained, resource-based optimization on acandidate set of hosts. The sampling based packing algorithm isimplanted to find a host for a VM from a sample of hosts based on one ormore resource metrics for optimization.

The scheduler 300 collects statistics from a cluster statisticscollector 310. The cluster statistics received from the clusterstatistics collector 310 includes host resource capacity, VM resourcedemand, and VM resource usage. The host resource capacity data includes,without limitation, total CPU utilization, total consumed memory, andtotal network receive and transmit usage. The cluster statisticscollector 310 in some examples, provides per-VM usage statistics, suchas the VM resource demand and VM resource usage, to the scheduler.

Network traffic between hosts and VMs in a cluster is unstable. Thenetwork traffic frequently includes periods of high usage followed bylow usage. Due to the peaks and valleys in network traffic, averagingnetwork traffic usage for a VM or host is not always useful. Therefore,the VM network resource usage statistics may be provided using apercentile measure. In these examples, a percent high-water mark may beused for stability in determining network usage. In one non-limitingexample, the percent high-water mark is the seventy-fifth percentile. Inother examples, a high-water mark of the eightieth percentile may beutilized.

Moreover, in some examples, the cluster statistics include internal sendand receive traffic occurring on a single host, as well as external sendand receive traffic occurring across different hosts. The externalnetwork traffic is more expensive than the internal network traffic.These internal versus external communications traffic statistics areconsidered to avoid separating VMs which communicate at a high rate withone another on the same host.

The scheduler retrieves the statistics from the cluster statisticscollector 310 to evaluate the cluster status as the VMs are powering-on.The scheduler 300 also receives basic topology information, rackboundary data, and link bandwidth data from the static configurationcomponent 314.

In this example, the coarse-grained scheduler 304 finds a host for apowering-on or already running but resource-wise distressed VM. Whenfinding such a host, there are many metrics that the scheduler canoptimize. In some examples, the coarse-grained scheduler optimizes forhigher packing density, minimal incoming VM rejections, speed oflocating a suitable host, or load-balancing.

In some examples, the optimizer 304 performs a dot product algorithm ona candidate set of hosts to select the best host for placement of aparticular VM. The optimizer 304 selects a host which fits theparticular VM by defining alignment of a task relative to a machineacross multiple dimensions. The dimensions are different resources. Inthe simplest case, where there is only one dimension, such as but notlimited to a CPU resource, the optimizer 304 picks the largest task thatfits a given host.

Extending the optimization to multiple dimensions, the larger thealignment, the lower the fragmentation of the resources. Thecoarse-grained scheduler 302 picks the best available host such that theVM's resource demand vector and the host's available resource vector isaligned. The dot product between the VM's resource demand vector and thehost's available resource vector gives the best packing efficiency forthe cluster.

VM scheduling places the identified VM on the selected host in aplurality of hosts 318 in the cluster. The plurality of hosts 318includes a plurality of VMs 320 running on one or more of the hosts. Theplurality of hosts 318 may be implemented as physical host computingdevices. In other examples, a host in the plurality of hosts 318 isimplemented as a hypervisor running one or more VMs.

The fine-grained scheduler 306 optimizes for inter-VM communications.The fine-grained scheduler 306 includes an optimizer 308 for performinga fine-grained optimization on a set of communicating VMs. Thefine-grained optimization permits the scheduler to place VMs closerbased on communication patterns between VM. This enables optimization ofVMs that communicate at a higher rate with one or more other VMs.

During the fine-grained optimization, the scheduler 306 relocates a VMfrom one host in the cluster to another host in the cluster to optimizefor application demands. This may be accomplished, in some examples, byco-locating two or more communicating VMs. Co-locating the VMs refers tomoving the communicating VMs closer together within the cluster tominimize the distance between VMs. In some examples, a pair of VMs areco-located by placing the pair of VMs on the same host. In otherexamples, a pair of VMs are co-located by placing the VMs on differenthosts within the same rack.

The application statistics collector 312 collects VM data associatedwith communications between the VMs. The applications statisticscollector 312 provides this data to the scheduler as a set ofcommunication graphs. The set of communication graphs includes one ormore VM-to-VM communication graphs. In some examples, a communicationgraph provides data regarding communications between two VMs.

In some examples, the application statistics collector 312 generates thecommunications graphs based on Internet Protocol Flow Information Export(IPFIX) records obtained from virtual switches running inside each host.These records are collected periodically, at regular intervals. In oneexample, the records are collected at every one-minute interval. Theserecords are collected by an IPFIX collector service and then summarizedinto one or more communication graphs for the use of the fine-grainedscheduler. In some examples, the scheduler analyzes the set ofcommunication graphs to identify one or more VMs for optimization.

In other examples, the IPFIX collector service does not provide recordsof each and every VM-VM communication activity to the scheduler. Incases involving large numbers of VMs, the amount of inter-VMcommunications data may be prohibitively large. In such cases, the IPFIXcollector service provides the communications records for the top VMcandidates to be co-located based on inter-VM communications by thefine-grained scheduler. This minimizes the amount of VM communicationsdata provided to the scheduler 300.

The optimizer 308 includes a penalty-based VM co-location algorithm toco-locate communicating VMs based on a penalty score. The penalty scoresare generated by a penalty function based on communication patternsbetween the VMs. In some examples, the penalty function analyzescommunication graphs between VMs to determine the penalty score for eachVM being analyzed.

The penalty score indicates whether a VM communicates with one or moreother VMs at a relatively higher rate or across a greater distance thanother VMs in the cluster.

The penalty score in some examples indicates a distance between two ormore communicating VMs. In some examples, the penalty score isproportional to the rate a pair of VMs are communicating. In still otherexamples, the penalty score is proportional to the rate at which a pairof VMs are communicating as well as the distance between the two VMs inthe pair.

If the VMs are in the same host, the pair of VMs have a very lowdistance. The distance increases when the communications cross hostboundaries within a rack. The distance increases again when thecommunications between VMs cross rack boundaries. Thus, VMscommunicating across rack boundaries have a higher penalty score than apair of VMs communicating across host boundaries within the same rack.Likewise, a penalty score for a pair of VMs communicating betweendifferent hosts is greater than the penalty score for communicating VMson the same host.

In one non-limiting example, a first pair of VMs located far apart mayhave a lower penalty score than a second pair of VMs located a littlecloser together if the first pair of VMs communicate infrequently, whilethe second pair of VMs communicate more frequently. In such a case, thepenalty scores are lowered moving the second pair of VMs closer togetherrather than moving the first pair of VMs.

Co-locating communicating VMs in some examples is a large systemoptimization. The fine-grained scheduler optimizes the overall VM-to-VMtraffic matrix. However, when performing a full system optimization,optimization may require a non-trivial amount of time and resources dueto the large problem size of the VM-to-VM matrix optimization.

Moreover, the fine-grained optimizer in some examples works on a pastsnapshot of traffic matrix. The fine-grained scheduler in such casescannot converge to a final solution because the traffic matrix maychange before an optimal solution is calculated. Moreover, a longermigration time makes convergence more difficult to achieve.

Therefore, in some examples, the fine-grained scheduler does notoptimize the entire system in one run. Instead, the fine-grainedscheduler works in a greedy fashion where it optimizes a small fractionof the traffic matrix in a given pass, such as in a candidate set ofhosts or a candidate set of VMs. The fine-grained scheduler co-locatesVMs in multiple passes such that the VM placement is at approximateoptimal state following two or more rounds of fine-grained optimizationon two or more sets of VMs.

Alternatively, or in addition, the fine-grained scheduler isopportunistic. The scheduler in these examples only moves acommunicating VM if there are sufficient resources available on thetarget host. The scheduler runs the dot product algorithm on thecandidate set of hosts in the communication domain and picks the bestavailable host.

In other examples, the fine-grained scheduler performs a cost-benefitanalysis prior to migrating a VM to a different host. The cost-benefitanalysis is performed to determine migration costs associated withmoving the VM from the current host to a different host.

FIG. 4 is an exemplary block diagram illustrating a coarse-grainedscheduler. The coarse-grained scheduler 400 selects a host for a newvirtual computing instance or a migrating virtual computing instance.

In this non-limiting example, the coarse-grained scheduler 400 selects ahost for a powering-on VM or migrating a VM from one host to anotherhost. The coarse-grained scheduler 400 receives resource statistics. Theresource statistics include resource usage statistics 402, host resourcecapacity 404, and VM resource demands 406 for a plurality of hosts 408,a plurality of VMs 410 being hosted on the plurality of hosts 408,and/or an identified VM 434 to be placed on a host. An identified VM 434may include a new powering-on VM or a VM to be relocated from one hostto another host.

VM's have few resource shapes or resource skews. For a given VM, itrequires either a high memory, high CPU, high networking, or acombination of these. For example, a VM may utilize a combination of ahigh networking and a high CPU due to the positive correlation ofnetworking workload and CPU cycles consumes. Therefore, not allcombinations are practical.

Moreover, in trace based experiments, utilization of the dot productalgorithm by coarse-based schedulers resulted in best results both inpacking quality and balance in the cluster. However, dot productrequires traversing all the available hosts and calculating the dotproduct in order to select a host which has the resource vector thatbest aligns with the VM's resource demand vector. This does not scalewell with cluster size.

In some examples, a coarse-grained analyzer 436 performs acoarse-grained analysis 438 on a candidate set of hosts 432. Thecoarse-grained analysis includes running the dot product algorithm onthe selected candidate set of hosts to select the best host for aparticular VM based on optimizing one or more resources in the cluster.In this manner, the runtime of the scheduler is constant to find alocation of the VM. Moreover, performing the dot product analysis on asample of hosts picked from the cluster minimizes the pool of hosts thatthe dot product algorithm uses to select a best available host for theVM. This minimizes the pool of hosts that the dot product algorithm usesto find the best available host.

The candidate set of hosts is selected by a selection component 412 insome examples. As discussed above, both VMs and hosts have resourceutilization shapes. The VMs' resource demand vectors and the hosts'available resources vectors have distinct shapes. The hosts in thecluster are organized into a set of queues 418 based on these resourceshapes.

The set of queues 418 is a set of priority queues including an orderedlist of hosts from the plurality of hosts. The set of queues 418includes two or more queues providing an ordered list of hosts based onthe resource shapes of the hosts.

In one non-limiting example, the set of queues provides ordered lists ofhosts based on resource shapes for CPU and memory resources. In thisexample, the set of queues 418 include three queues: a CPU queue 420, amemory queue 424, and a CPU+memory queue 428. The CPU+memory queue is aqueue for a combination of resources.

In another example, the set of queues may include seven queues for CPU,network, and memory resources. In this example, the set of queuesincludes a CPU queue, a network queue, a memory queue, a CPU+memoryqueue, a CPU+network queue, a network+memory queue, and aCPU+memory+network queue.

In yet another example, the queues may include queues for CPU andnetwork resources. In this example, the set of queues includes a CPUqueue, a network queue, and a CPU+network queue.

In some examples, the hosts are listed in accordance with availabilityof the one or more resources. For example, queue 420 may include a listof hosts 422 ordered in accordance with CPU resources associated witheach host in the list. A host having the greatest CPU resource capacityis listed higher or with greater priority in the list than a host withless CPU resource capacity. Likewise, queue 424 may include a list ofhosts 426 ordered in accordance with network resources available on eachhost. Queue 428 includes an ordered list of hosts 430 ordered inaccordance with CPU and network resources.

In other examples, hosts are listed in accordance with resourceavailability and a number of VMs on a host. For example, hosts may belisted in queues in accordance with a number of VMs on the hosts. A hostwith no VMs running on the host are listed with a higher priority than aVM with one or more VMs. In this example, the hosts with no VMs or hostswith only a single VM are selected out of each queue before hosts withtwo or more VMs having a same or similar resource availability.

In other examples, hosts are listed in accordance with resourceavailability and a distance between hosts. For example, hosts located ona particular rack or hosts on a same rack may be listed in a higherpriority than hosts in different racks having a same or similar resourceavailability.

Hosts are added to each queue in the set of queues during initializationof the set of queues 418. When the cluster is initialized, the clustermanager calls the initialization code of the scheduler to build thepriority queues.

In some examples, each host in the plurality of hosts is assigned to atleast one queue. In this example, each host in the plurality of hosts isplaced into at least one queue in the plurality of queues in an ordercorresponding to at least one resource associated with each host.

During the VM placement phase, the selection component 412 selects apredetermined number “K” of hosts 416 from each queue based on amultiqueue-K algorithm 414. The pre-determined number of hosts 416 is afixed number of hosts taken from each queue. In a non-limiting example,if the predetermined number of hosts 416 is three (3), the selectioncomponent 412 selects three hosts from queue 420, three hosts from queue424, and three hosts from queue 428 for a candidate set of hosts 432that includes nine (9) hosts. In operation, the multiqueue-K algorithm414 selects the hosts from each queue, where “K” is the number of hoststhe coarse-grained scheduler 400 pops from each priority queue in theset of queues 418 for one VM placement optimization.

In some examples, the predetermined number of hosts “K” is auser-configurable number of hosts. The higher the “K” value, the moreaccurate the host sampling. However, as the “K” value increases, theoverhead costs also increase. The overhead costs refer to resourceutilization for selecting a host, such as processor utilization, etc.Thus, the “K” value controls the expense of the search for a host.

In some example, the predetermined number of hosts is selected by anadministrator. In other examples, the predetermined number of hosts is adefault value. In other examples, the predetermined number of hosts isdetermined based on time of day, day of the week, peek workload periods,or other factors. For example, during a peek workload period, thepredetermined number of hosts may be a smaller number of hosts while alarger number of hosts is selected during off-peak hours, or vice-versa.Likewise, a different predetermined number of hosts may be selected onworkdays than on weekends or holidays.

The number of queues grows exponentially with the number of resourcetypes considered for placement. However, in some examples, utilizing twoto four resource types is sufficient for resource-based optimization ofplacement decisions. Moreover, some queues may be eliminated from theset of queues by analyzing VM resource shapes used in one or moredatacenters.

The coarse-grained scheduler analyzes the candidate set of hosts toidentify a host for a particular VM. In some examples, the analysis ofthe candidate set of hosts includes performing a dot product algorithmon the candidate set of hosts.

The dot product analysis is only performed with regard to hosts in thecandidate set of hosts rather than analyzing all hosts in the cluster.Because this is not an exhaustive search, the selection process is moreefficient.

The VM is placed on the selected host. The list of hosts in the set ofhosts 418 is updated to reflect changes in host resource capacity of theselected host 442 due to the placement of the identified VM onto theselected host. The set of queues may be updated on detection of a changeto a resource capacity of a host. In other words, the set of queues maybe updated when a powering-on VM is placed on a host or when a VM ismigrated from one host to another host.

When a powering-on VM is placed on a selected host, the selected hostspriority listing in a queue is updated to reflect the changes in thehost's reduced resource availability. When a VM is migrated off a firsthost and onto a second host, the set of queues are updated to reflectthe resources released on the first host and the resources of the secondhost taken up by the VM as a result of the VM relocation.

In other examples, the set of queues 418 are updated periodically uponoccurrence of an update time interval. The update time interval may be adefault amount of time, a user-defined amount of time, or any other timeinterval. This enables updates of the set of queues at regularintervals. Updating a queue in the set of queues 418, in some examples,has a cost of log h steps, where “h” is the number of hosts in thecluster.

In still other examples, after placing a VM on a selected host, thecoarse-grained scheduler 400 updates the host resource capacity 404 forthe selected host. The resource capacity for the selected host isupdated to indicate resources consumed by the identified VM, such as,but not limited to, the CPU, memory, networking, and/or storageresources consumed by the identified VM placed on the selected host.This may be accomplished by updating host available resource vectors todeduct resources reserved or allocated to the identified VM.

FIG. 5 is an exemplary block diagram illustrating a fine-grainedscheduler. The goal of the fine-grained scheduler 500 is to optimize forcommunications between virtual computing instances. More specifically,in this example, the fine-grained scheduler 500 optimizes communicationsbetween VMs (inter-VM) communications.

In this non-limiting example, the fine-grained scheduler performs apenalty-based optimization in which a penalty score for a set of two ormore communicating VMs 504 is reduced by migrating one or more of theVMs to a different host to co-locate the VMs. Co-locating the VMsminimizes the communications distance between the VMs.

The fine-grained scheduler 500 in this example, receives a set ofcommunication graphs 502 associated with a plurality of VMs. The set ofcommunication graphs 502 includes one or more communication graphsassociated with two or more communicating VMs. A penalty function 506 isutilized to analyze the set of communication graphs 502 to generate aset of penalty scores for the plurality of VMs.

The fine-grained analysis 510 analyzes the set of penalty scores 506 toidentify the set of communicating VMs 504 from the plurality of VMs foroptimization. In some examples, the set of communicating VMs are the VMswith the highest penalty score(s). These VMs are the top network trafficutilizers.

In other examples, the set of communicating VMs are VMs having a penaltyscore that exceed a threshold penalty score. In this example, a penaltyscore for each pair of VMs is compared to the threshold score todetermine whether to co-locate the pair of VMs.

In other examples, the set of communicating VMs 504 includes VMs havinga higher priority than other VMs in the plurality of VMs. In theseexamples, VMs with lower penalty scores are optimized prior to VMs withhigher penalty scores if the lower penalty score VMs have a higherpriority than the other VMs. For example, VMs associated with anapplication may be co-located prior to VMs running other lower priorityapplications.

In still other examples, the set of communicating VMs 504 includes twoor more VMs having the same group tag. The VMs with the same group tagin this example are co-located together.

In some examples, the penalty function 506 is utilized to analyzecommunication graphs for a candidate set of VMs or VMs located on acandidate set of hosts. The optimizer 512 utilizes the penalty scoresgenerated based on the communication graphs to select a set ofcommunicating VMs 504 for co-location.

In some examples, the set of communicating VMs 504 includes a singlepair of communicating VMs, such as VM1 and VM2 having one penalty scorefor the pair of VMs. In another example, the set of communicating VMsmay include three VMs, such as a first pair of VMs (VM1 and VM2) havinga first penalty score and a second pair of VMs (VM2 and VM3) having asecond penalty score. In yet another example, the set of communicatingVMs may include four VMs, such as a first pair of VMs (VM1 and VM2)having a first penalty score and a second pair of VMs (VM3 and VM4)having a second penalty score.

The fine-grained scheduler 500 may attempt to co-locate two or morecommunicating VMs in closer proximity to one another. This may beaccomplished by placing the communicating VMs onto the same host. From apure communication point of view, it is desirable for the fine-grainedscheduler 500 to co-locate all VMs which are communicating with eachother to a single host. However, the fine-grained scheduler may not beable to fit all the communicating VMs in a single application to asingle host. Moreover, if all communicating VMs associated with a givenapplication are placed on the same host, it limits the failure domain ofthe application to a single host.

In other examples, the fine-grained scheduler 500 attempts to co-locatecommunicating VMs to a single rack. However, co-locating allcommunicating VMs to the same rack results in inter-host VM trafficgoing through the hypervisor networking stack. However, the trafficstays within the same top of the rack switch which has higher capacitythan the inter-rack links. Also, because racks have many hosts, in someexamples, it is easier for a fine-grained scheduler 500 to findavailable spaces or holes to fit all the VMs in the communicating group.

The fine-grained scheduler 500 attempts to co-locate the VMs to the samehost or the same rack using a penalty function 506. The penalty function506 analyzes communication graphs in the set of communication graphs 502to generate a set of penalty scores 508. The set of penalty scores 508includes one or more penalty scores.

The fine-grained scheduler 500 minimizes the penalty score associatedwith a given set of communicating VMs by performing migrations of one ormore of the VMs. The fine-grained analyzer 504 migrates at least one VMin the set of communicating VMs to a different host to minimize thedistance between two or more of the communicating VMs and/or reduce atleast one penalty score.

The penalty score in some examples is reduced by performing a single VMmigration. For example, a first VM on a first host communicating with asecond VM on a second host may be moved from the first host to thesecond host to achieve optimization. The penalty score is reduced inother examples by performing two or more VM migrations. For example, thefirst VM may be moved from the first host to a third host and the secondhost may be moved from a second host to the third host to co-locate thefirst and second VMs on the same host.

Thus, the two-tiered scheduler, including the coarse-grained schedulerand the fine-grained scheduler, is invoked in some examples when a newVM is powered on in the cluster or when a VM is or moved into thecluster. For example, when a VM is powered on, the scheduler is invokedto locate the host for the VM. The priority queues are already built upby the initialization code. The scheduler pops the predetermined number“K” of hosts from each queue to create a candidate set of hosts. Thescheduler finds the best host in the candidate set of hosts based on thedot product. The scheduler updates the queues to indicate the VMplacement.

In other examples, the two-tiered scheduler is invoked periodically forremediation of resource-wise distressed VMs and co-location ofcommunicating VMs. For example, when the scheduler is invokedperiodically, it identifies a set of top distressed VMs in the clusterbased on the penalty scores for the VMs. The scheduler performs themultiqueue-K algorithm to find a suitable host for each of the VMs inthe set of VMs. The scheduler attempts to co-locate one or more of thecommunicating VMs by invoking the fine-grained optimizer. The schedulerqueries one or more IPFIX based communication graph(s) to identify topcommunicating VMs and apply the penalty function for these VMs togenerate penalty scores. Based on the penalty scores and a migrationbudget the scheduler is allowed, the scheduler picks top candidates forthe co-location.

In other examples, upon completion of the co-location, the schedulermarks the VMs which are communicating with each other with a tag toindicate that the VM belongs to a communication group. The tag is placedon the communicating VMs so that during the next scheduler optimizationpasses, the tagged VMs have less chance of getting migrated for otherreasons, such as for remediation of over-utilized host. During the firstphase of the remediation/load-balancing pass, when moving a distressedVM to a different host, the scheduler avoids moving the VMs with a grouptag.

In some examples, VMs without a group tag are moved preferentially overVMs with a group tag. In other examples, VMs with a group tag are notmoved. In still other examples, VMs with a group tag are movedsimultaneously with one or more VMs having the same group tag to thesame host or the same rack. In yet other examples, a VM with a group tagis only moved to a host that will place the VM into closer proximity toone or more other VMs with the same group tag.

FIG. 6 is an exemplary flow chart illustrating operation of a two-tieredscheduler. The process shown in FIG. 6 may be performed by a schedulerexecuted by a computing device, such as, but not limited to, thescheduler 132 in FIG. 1 or scheduler 300 in FIG. 3. The computing devicemay be implemented as a computing device, such as but is not limited to,a server in set of servers 124 in FIG. 1, host computing device 200 inFIG. 2, host computing device 1500, or host computing device 1600 inFIG. 16. Further, execution of the operations illustrated in FIG. 6 isnot limited to a scheduler. One or more computer-readable storage mediastoring computer-readable instructions may execute to cause at least oneprocessor to implement the operations illustrated in FIG. 6.

A candidate set of hosts is selected at 602. A determination is made asto whether to perform a coarse-grained analysis at 604. If yes, acoarse-grained, resource-based optimization is performed on a candidateset of hosts to select a host at 606. An identified VM is placed on theselected host at 608.

A determination is made as to whether to perform a fine-grainedoptimization at 610. If no, the process terminates thereafter. If yes,penalty scores for a set of VMs are analyzed at 612. A first VM in theset of VMs is relocated from a first host to a second host to minimize adistance between the first VM and a second VM based on the penaltyscores analysis to co-locate the VMs. Co-locating the first VM and thesecond VM reduces at least one penalty score at 616. The processterminates thereafter.

The process in FIG. 6 is described as being implemented to performscheduling of VMs. However, in other examples, the process isimplemented for scheduling with regard to containers.

While the operations illustrated in FIG. 6 are described as beingperformed by a host computing device or a server, aspects of thedisclosure contemplate that performance of the operations by otherentities. For example, a cloud service associated with a cloud, such ascloud 102 in FIG. 1, may perform one or more of the operations.

FIG. 7 is an exemplary flow chart illustrating operation of acoarse-grained scheduler. The process shown in FIG. 7 may be performedby a scheduler on a computing device, such as, but not limited to, thescheduler or 132 in FIG. 1, scheduler 300 in FIG. 3, or coarse-grainedscheduler 400 in FIG. 4. The computing device may be implemented as acomputing device, such as but is not limited to, a server in set ofservers 124 in FIG. 1, host computing device 200 in FIG. 2, hostcomputing device 1500, or host computing device 1600 in FIG. 16.Further, execution of the operations illustrated in FIG. 7 is notlimited to a scheduler. One or more computer-readable storage mediastoring computer-readable instructions may execute to cause at least oneprocessor to implement the operations illustrated in FIG. 7.

An identification of a VM to be placed on a host is received at 702. TheVM to be placed on a host may be a powering-on VM or a VM that is beingmoved from one host to another host. A predetermined number of hosts isselected from a plurality of queues to generate a candidate set of hostsat 704. Resource statistics for a set of VMs and the candidate set ofhosts is received at 706. The candidate set of hosts are analyzed usingcoarse-grained, resource-based optimization at 708. A host is selectedbased on analysis results at 710. An identified VM is placed on theselected host at 712. The process terminates thereafter.

In this example, a predetermined number of hosts are selected from aplurality of queues. In some examples, the predetermined number of hostsis selected from each queue in the plurality of queues to generate thecandidate set of hosts.

The process in FIG. 7 is described as being implemented to performcoarse-grained optimizations with regard to VMs. However, in otherexamples, the optimizations are performed with regard to containers.

While the operations illustrated in FIG. 7 are described as beingperformed by a host computing device or a server, aspects of thedisclosure contemplate that performance of the operations by otherentities. For example, a cloud service associated with a cloud, such ascloud 102 in FIG. 1, may perform one or more of the operations.

FIG. 8 is an exemplary flow chart illustrating operation of afine-grained scheduler. The process shown in FIG. 8 may be performed bya scheduler on a computing device, such as, but not limited to, thescheduler 132 in FIG. 1, scheduler 300 in FIG. 3, or fine-grainedscheduler 500 in FIG. 5. The computing device may be implemented as acomputing device, such as but is not limited to, a server in set ofservers 124 in FIG. 1, host computing device 200 in FIG. 2, hostcomputing device 1500, or host computing device 1600 in FIG. 16.Further, execution of the operations illustrated in FIG. 8 is notlimited to a scheduler. One or more computer-readable storage mediastoring computer-readable instructions may execute to cause at least oneprocessor to implement the operations illustrated in FIG. 8.

A set of communication graphs associated with a plurality of VMs isanalyzed based on a penalty function at 802. In this example, a set ofpenalty scores is generated based on communication graphs analysis at804. A set of VMs are selected based on the set of penalty scores at806. A first VM in the set of VMs is relocated from a first host to asecond host at 808. A determination is made as to whether at least onepenalty score is reduced at 810. If yes, the process terminatesthereafter.

Returning to 810, if at least one penalty score is not reduced, theprocess returns to 802. The process iteratively executes 802-810 untilat least one penalty score is reduced at 810. The process terminatesthereafter.

The process in FIG. 8 is described as being implemented to performfine-grained optimizations with regard to VMs. However, in otherexamples, the fine-grained optimizations are performed with regard tocontainers.

While the operations illustrated in FIG. 8 are described as beingperformed by a host computing device or a server, aspects of thedisclosure contemplate that performance of the operations by otherentities. For example, a cloud service associated with a cloud, such ascloud 102 in FIG. 1, may perform one or more of the operations.

Thus, the two-tiered resource scheduler in some examples performsinitial placement, remediation of resource contention, as well asco-location of communicating VMs to improve application performance in acluster. The coarse-grained scheduler performs initial placement andresource contention remediation based on resource shape, sampling, andvector dot product. The fine-grained, co-location scheduler usesdistance and throughput based penalty function to identify and greedilyco-locate communicating VMs.

A combination of simulation results and cluster experiments may be usedto highlight gaps in current schedulers and demonstrate the strengths ofthe coarse-grained scheduler and the fine-grained scheduler. To evaluateperformance of a scheduler in a large scale system, a trace-drivensimulator and use sequences of snapshots from internal NIMBUS clusterscontaining more than one-hundred (100) hosts and one-thousand (1,000)VMs are used. A snapshot contains each VM's resource requirements,hosts' resource capacities, and other static information, such as, forexample, the current VM-to-host mapping.

The simulator mimics the manner in which different coarse-grainedoptimization algorithms make VM-to-host mapping decisions usinginformation available from a snapshot. Imbalance across hosts, clustertotal utilization, and algorithm runtime are used to evaluate theeffectiveness of an algorithm. To evaluate an algorithm in differentsystem sizes, the snapshot is scaled horizontally to different systemsizes. FIG. 9, FIG. 10, and FIG. 11 below illustrate how sampling-based,multiqueue-K algorithm for virtual computing instance schedulingcompares to other algorithms when evaluated using this simulator.

FIG. 9 is an exemplary graph illustrating system utilization ofdifferent scheduling algorithms under different system sizes. The graph900 shows total utilization in percentage (%) along the vertical y-axisand the various algorithms along the horizontal x-axis. The schedulersin this example include a random scheduler 902 as a lower bound, a dotproduct 904 without a fixed-size sampling, a state-of-art clusterscheduler (dot-rand-32) 906, a dot product with multiqueue-K algorithm908 using a fixed-size host sampling, and a network aware distributedresource scheduler (DRS) 910. The multiqueue-K algorithm 908 in thisexample uses a “K” value of four (4). The multiqueue-K algorithm 908pops the top four compatible host candidates from a set of queues.

The different system sizes in this example include 1,600 VMs on 64 hostsidentified in the graph by an “A”; 6,250 VMs on 250 hosts identified by“B”; 25,000 VMs on 1,000 hosts indicated by “C”; and 100,000 VMs on4,000 hosts indicated by “D” for each different algorithm.

FIG. 10 is an exemplary graph illustrating system imbalance of differentscheduling algorithms under different system sizes. The graph 1000 showsimbalance (standard deviation norm, entitlement) along the verticaly-axis and the various algorithms along the horizontal x-axis. Thealgorithms in this example includes the random scheduler 902, the dotproduct 904 without a fixed-size sampling, the state-of-art clusterscheduler 906, the multiqueue-K algorithm 908 having a “K” value offour, and a network aware DRS 910. The different system sizes in thisexample include 640 VMs on 64 hosts indicated in the graph by “A”; 2,500VMs on 250 hosts identified by a “B”; 10,000 VMs on 1,000 hostsidentified by “C”; and 40,000 VMs on 4,000 hosts indicated by a “D” foreach different algorithm.

FIG. 11 is an exemplary graph illustrating algorithm runtime ofdifferent scheduling algorithms under different system sizes. The graph1100 shows imbalance (standard deviation norm, entitlement) along thevertical y-axis and the various algorithms along the horizontal x-axis.The algorithms in this example include a random scheduler 902 as a lowerbound, a dot product 904 without a fixed-size sampling, a state-of-artcluster scheduler 906, a multiqueue-K algorithm 908 with a “K” value of4, and a network aware DRS 910.

The different system sizes in this example include 1,600 VMs on 64 hostsshown by an “A”; 6,250 VMs on 250 hosts identified by “B”; 25,000 VMs on1,000 hosts identified by a “C”; and 100,000 VMs on 4,000 hostsidentified in the graph by a “D” for each different algorithm.

As shown in FIG. 9, FIG. 10, and FIG. 11 above, the dot product 904algorithm achieves the highest utilization and the lowest imbalance, butbecause it compares all VMs and hosts, the algorithm runtime increasesfaster than other dot product with sampling and multiqueue as the systemsize increases. Applying pure sampling to dot product improves theruntime scalability. It also achieves similarly high utilization as theoriginal algorithm. However, because it does not find the bestcandidates, it sacrifices imbalances.

The multiqueue-K algorithm 908 combines the advantages of sampling andexhaustive search. It achieves similarly high utilization and lowimbalances as the original dot product 904 algorithm. Although theruntime of multiqueue algorithm 908 is around four times (4×) higherthan pure sampling, it scales very well compared with original dotproduct 904 algorithm.

The network aware DRS 910 also achieves very low imbalance because itwas designed to reduce imbalance. However, its exhaustive searchincreases the algorithm runtime drastically. Moreover, the DRS 910cannot complete for the largest configuration due to its exhaustivesearch algorithm. In the examples shown in FIG. 9, FIG. 10, and FIG. 11,there are no bars indicated by “D” for the DRS 910.

The graphs in FIG. 12 and FIG. 13 below, illustrate how thepredetermined number of hosts “K” influences the multiqueue algorithm.FIG. 12 is an exemplary graph illustrating algorithm runtime ofmultiqueue-K algorithms using different top K candidates. The verticaly-axis of the graph 1200 shows time in seconds depicted in log scale.The horizontal x-axis identifies the scheduling algorithms. Thealgorithms in this example include a multiqueue-K algorithm with a “K”value of 1 at 1202 for 1 top candidate; multiqueue-K algorithm with a“K” value of 2 at 1204 that pops 2 top candidates at 1204; amultiqueue-K algorithm with a “K” value of 4 at 1206; a multiqueue-Kalgorithm with a “K” value of 8 at 1208; a multiqueue-K algorithm with a“K” value of 16 at 1210; a multiqueue-K algorithm with a “K” value of 32at 1212; a state-of-art cluster scheduler (dot-rand-32) at 1214; and adot product algorithm at 1216.

FIG. 13 is an exemplary graph illustrating system utilization ofmultiqueue-K algorithms using different top “K” candidates. The graph1300 includes a vertical y-axis for total percentage (%) utilization.The horizontal x-axis identifies the scheduling algorithms. Thealgorithms in this example include a multiqueue-K algorithm for 1 topcandidate at 1202; multiqueue-K algorithm with a “K” value of 2 at 1204;a multiqueue-K algorithm with a “K” value of 4 at 1206; a multiqueue-Kalgorithm with a “K” value of 8 at 1208; a multiqueue-K algorithm with a“K” value of 16 at 1210; a multiqueue-K algorithm with a “K” value of 32at 1212; a state-of-art cluster scheduler (dot-rand-32) at 1214; and adot product algorithm at 1216.

The graphs 1200 and 1300 illustrate how system utilization and algorithmruntime change as the number of candidates the algorithm considers ineach queue is altered. A “K” value of 4 which results in comparing thetop four candidates in each queue achieves most of the benefits whileadding acceptable constant runtime overhead.

In FIG. 14, FIG. 15, and FIG. 16 below, the same trace-driven simulatorusing sequences of snapshots is used to evaluate fine-grainedoptimization algorithms. Distance cost is defined as one (1) forintra-host, five (5) for inter-host, twenty-five (25) for inter-rack,and use total communication cost of a VM-to-host mapping as the metricto evaluate algorithms. The snapshots are gathered from productionclusters. The snapshots do not contain VM communication graphs.Therefore, the snapshots are combined with customized snapshotscontaining the communication graphs of a REDIS web application deployedwith different number of VMs.

FIG. 14 is an exemplary graph illustrating reduction of communicationcost of different scheduling algorithms over random scheduler. The graph1400 shows how the penalty-aware algorithm improves applicationperformance. The vertical y-axis is a sum of communication cost inpercentage (%) normalized to baseline.

Five algorithms are shown along the horizontal x-axis. The algorithmsinclude a dot product algorithm indicated by an “A” without anyfine-grained optimization; a greedy algorithm that tries to co-locatecommunicating VMs on a single host identified by a “B”; a greedyalgorithm that tries to co-locate communicating VMs on a single rackidentified by a “C”, a state-of-art, greedy-based VM scheduleridentified by a “D”, and a penalty-aware algorithm co-locating VMs basedon penalty scores identified by a “E”.

The algorithms in this example are evaluated by comparing their totalcommunication costs. The total communication costs are normalized to dotproduct without any fine-grained optimization. The number ofcommunicating VMs are varied within a web application to see how thedifference in communication graphs affects algorithm performance. Inthis example, the set of communicating VMs includes three (3) VMs at1402; six (6) VMs at 1404; and nine (9) VMs at 1406.

When the size of the VMs in the set is small (e.g., 3), co-locating VMsinto a single host is very effective because it is more likely enoughspace is available on a single host to accommodate all the VMs in asmall group of VMs, such as shown by B at 1402. However, when the numberof VMs in the group is larger and a size of the communication graphincreases, it is less likely to co-locate all the VMs in the group ontoa single host. Therefore, co-location in a single rack becomes moreeffective with a larger group of VMs than attempting to co-locate on asingle host, as indicated by B and C at 1404 and 1406.

The state-of-art, greedy-based VM scheduler is stable across differentcluster sizes. However, it performs coarse-grained and fine-grainedoptimization within a single pass, and thus reduces the effectiveness inboth parts.

The penalty-aware algorithm shown at E performs equally well in bothsmall size and large size. The penalty-aware algorithm performs wellbecause it adapts to different schemes automatically.

The examples shown above are described as being implemented to place andmigrate VMs. However, in other examples, the scheduler is implemented toplace and migrate containers, or other virtual computing instances.

FIG. 15 is a block diagram of an example host computing device. A hostcomputing device 1500 includes a processor 1502 for executinginstructions. In some examples, executable instructions are stored in amemory 1504. Memory 1504 is any device allowing information, such as,but not limited to, executable instructions, to be stored and retrieved.For example, memory 1504 may include one or more random access memory(RAM) modules, flash memory modules, hard disks, solid state disks,and/or optical disks.

Host computing device 1500 may include a user interface device 1510 forreceiving data from a user 1508 and/or for presenting data to user 1508.User 1508 may interact indirectly with host computing device 1500 viaanother computing device such as VMware's vCenter Server or anothermanagement device. User interface device 1510 may include, for example,a keyboard, a pointing device, a mouse, a stylus, a touch sensitivepanel (e.g., a touch pad or a touch screen), a gyroscope, anaccelerometer, a position detector, and/or an audio input device.

In some examples, the user interface device 1510 operates to receivedata from the user 1508, while another device (e.g., a presentationdevice) operates to present data to user 1508. In other examples, theuser interface device 1510 has a single component, such as a touchscreen, that functions to both output data to user 1508 and receive datafrom the user 1508. In such examples, the user interface device 1510operates as a presentation device for presenting information to user1508. In such examples, the user interface device 1510 represents anycomponent capable of conveying information to user 1508. For example,the user interface device 1510 may include, without limitation, adisplay device (e.g., a liquid crystal display (LCD), organic lightemitting diode (OLED) display, or “electronic ink” display) and/or anaudio output device (e.g., a speaker or headphones). In some examples,the user interface device 1510 includes an output adapter, such as avideo adapter and/or an audio adapter. An output adapter is operativelycoupled to the processor 1502 and configured to be operatively coupledto an output device, such as a display device or an audio output device.

The host computing device 1500 also includes a network communicationinterface 1512, which enables the host computing device 1500 tocommunicate with a remote device (e.g., another computing device) via acommunication medium, such as a wired or wireless packet network. Forexample, the host computing device 1500 may transmit and/or receive datavia the network communication interface 1512. The user interface device1510 and/or network communication interface 1512 may be referred tocollectively as an input interface and may be configured to receiveinformation from the user 1508.

The host computing device 1500 further includes a storage interface 1516that enables the host computing device 1500 to communicate with one ormore data stores, which store virtual disk images, and/or softwareapplications suitable for use with the methods described herein. Inexample examples, the storage interface 1516 couples the host computingdevice 1500 to a storage area network (SAN) (e.g., a Fibre Channelnetwork) and/or to a network-attached storage (NAS) system (e.g., via apacket network). The storage interface 1516 may be integrated withnetwork communication interface 1512.

FIG. 16 depicts a block diagram of VMs 1635 ₁, 1635 ₂ . . . 1635 _(N)that are instantiated on host computing device 1600. The host computingdevice 1600 includes a hardware platform 1605, such as an x86architecture platform. The hardware platform 1605 may include aprocessor 1602, memory 1604, network communication interface 1612, userinterface device 1610, and other input/output (I/O) devices, such as apresentation device 1606. A virtualization software layer is installedon top of the hardware platform 1605.

The virtualization software layer supports a VM execution space 1630within which multiple VMs (VMs 1635 ₁-1635 _(N)) may be concurrentlyinstantiated and executed. Hypervisor 1610 includes a device driverlayer 1615, and maps physical resources of the hardware platform 1605(e.g., processor 1602, memory 1604, network communication interface1612, and/or user interface device 1610) to “virtual” resources of eachof the VMs 1635 ₁-1635 _(N) such that each of the VMs 1635 ₁-1635 _(N)has its own virtual hardware platform (e.g., a corresponding one ofvirtual hardware platforms 1640 ₁-1640 _(N)), each virtual hardwareplatform having its own emulated hardware (such as a processor 1645, amemory 1650, a network communication interface 1655, a user interfacedevice 1660 and other emulated I/O devices in VM 1635 ₁).

Hypervisor 1610 may manage (e.g., monitor, initiate, and/or terminate)execution of VMs 1635 ₁-1635 _(N) according to policies associated withhypervisor 1610, such as a policy specifying that VMs 1635 ₁-1635 _(N)are to be automatically respawned upon unexpected termination and/orupon initialization of hypervisor 1610. In addition, or alternatively,the hypervisor 1610 may manage execution VMs 1635 ₁-1635 _(N) based onrequests received from a device other than host computing device 1601.For example, the hypervisor 1610 may receive an execution instructionspecifying the initiation of execution of first VM 1635 ₁ from amanagement device via the network communication interface 1612 andexecute the execution instruction to initiate execution of first VM 1635₁.

In some examples, the memory 1650 in the first virtual hardware platform1640 ₁ includes a virtual disk that is associated with or “mapped to”one or more virtual disk images stored on a disk (e.g., a hard disk orsolid state disk) of the host computing device 1600. The virtual diskimage represents a file system (e.g., a hierarchy of directories andfiles) used by the first VM 1635 ₁ in a single file or in a plurality offiles, each of which includes a portion of the file system. In addition,or alternatively, virtual disk images may be stored on one or moreremote computing devices, such as in a storage area network (SAN)configuration. In such examples, any quantity of virtual disk images maybe stored by the remote computing devices.

The device driver layer 1615 includes, for example, a communicationinterface driver 1620 that interacts with the network communicationinterface 1612 to receive and transmit data from, for example, a LANconnected to the host computing device 1600. The communication interfacedriver 1620 also includes a virtual bridge 1625 that simulates thebroadcasting of data packets in a physical network received from onecommunication interface (e.g., network communication interface 1612) toother communication interfaces (e.g., the virtual communicationinterfaces of VMs 1635 ₁-1635 _(N)). Each virtual communicationinterface for each VM 1635 ₁-1635 _(N), such as the networkcommunication interface 1655 for the first VM 1635 ₁, may be assigned aunique virtual MAC address that enables virtual bridge 1625 to simulatethe forwarding of incoming data packets from the network communicationinterface 1612. In an example, the network communication interface 1612is an Ethernet adapter that is configured in “promiscuous mode” suchthat all Ethernet packets that it receives (rather than just Ethernetpackets addressed to its own physical MAC address) are passed to virtualbridge 1625, which, in turn, is able to further forward the Ethernetpackets to VMs 1635 ₁-1635 _(N). This configuration enables an Ethernetpacket that has a virtual MAC address as its destination address toproperly reach the VM in the host computing device 1600 with a virtualcommunication interface that corresponds to such virtual MAC address.

The virtual hardware platform 1640 ₁ may function as an equivalent of astandard x86 hardware architecture such that any x86-compatible desktopoperating system (e.g., Microsoft WINDOWS brand operating system, LINUXbrand operating system, SOLARIS brand operating system, NETWARE, orFREEBSD) may be installed as guest operating system (OS) 1665 in orderto execute applications 1670 for an instantiated VM, such as the firstVM 1635 ₁. The virtual hardware platforms 1640 ₁-1640 _(N) may beconsidered to be part of the VM monitors (VMM) 1675 ₁-1675 _(N) thatimplement virtual system support to coordinate operations between thehypervisor 1610 and corresponding VMs 1635 ₁-1635 _(N). Those withordinary skill in the art will recognize that the various terms, layers,and categorizations used to describe the virtualization components inFIG. 16 may be referred to differently without departing from theirfunctionality or the spirit or scope of the disclosure. For example, thevirtual hardware platforms 1640 ₁-1640 _(N) may also be considered to beseparate from VMs 1675 ₁-1675 _(N), and VMMs 1675 ₁-1675 _(N) may beconsidered to be separate from hypervisor 1610. One example of thehypervisor 1610 that may be used in an example of the disclosure isincluded as a component in VMware's ESX brand software, which iscommercially available from VMware, Inc.

Certain examples described herein involve a hardware abstraction layeron top of a host computer (e.g., server). The hardware abstraction layerallows multiple containers to share the hardware resource. Thesecontainers, isolated from each other, have at least a user applicationrunning therein. The hardware abstraction layer thus provides benefitsof resource isolation and allocation among the containers. In someexamples, VMs may be used alternatively or in addition to thecontainers, and hypervisors may be used for the hardware abstractionlayer. In these examples. each VM generally includes a guest operatingsystem in which at least one application runs.

For the container examples, it should be noted that the disclosureapplies to any form of container, such as containers not including aguest operating system, referred to herein as “OS-less containers” (see,e.g., www.docker.com). OS-less containers implement operatingsystem-level virtualization, wherein an abstraction layer is provided ontop of the kernel of an operating system on a host computer. Theabstraction layer supports multiple OS-less containers each including anapplication and its dependencies. Each OS-less container runs as anisolated process in user space on the host operating system and sharesthe kernel with other containers. The OS-less container relies on thekernel's functionality to make use of resource isolation (CPU, memory,block I/O, network, etc.) and separate namespaces and to completelyisolate the application's view of the operating environments. By usingOS-less containers, resources may be isolated, services restricted, andprocesses provisioned to have a private view of the operating systemwith their own process ID space, file system structure, and networkinterfaces. Multiple containers may share the same kernel, but eachcontainer may be constrained to only use a defined amount of resourcessuch as CPU, memory and I/O.

Exemplary Operating Environment

Exemplary computer readable media include flash memory drives, digitalversatile discs (DVDs), compact discs (CDs), floppy disks, and tapecassettes. By way of example and not limitation, computer readable mediacomprise computer storage media and communication media. Computerstorage media include volatile and nonvolatile, removable andnon-removable media implemented in any method or technology for storageof information such as computer readable instructions, data structures,program modules and the like. Computer storage media are tangible andmutually exclusive to communication media. Computer storage media areimplemented in hardware and exclude carrier waves and propagatedsignals. Computer storage media for purposes of this disclosure are notsignals per se. Exemplary computer storage media include hard disks,flash drives, and other solid-state memory. In contrast, communicationmedia typically embody computer readable instructions, data structures,program modules, or the like, in a modulated data signal such as acarrier wave or other transport mechanism and include any informationdelivery media.

Although described in connection with an exemplary computing systemenvironment, examples of the disclosure are capable of implementationwith numerous other general purpose or special purpose computing systemenvironments, configurations, or devices. In some examples, thecomputing system environment includes a first computer system at a firstsite and/or a second computer system at a second site. The firstcomputer system at the first site in some non-limiting examples executesprogram code, such as computer readable instructions stored onnon-transitory computer readable storage medium.

Examples of well-known computing systems, environments, and/orconfigurations that may be suitable for use with aspects of thedisclosure include, but are not limited to, mobile computing devices,personal computers, server computers, hand-held or laptop devices,multiprocessor systems, gaming consoles, microprocessor-based systems,set top boxes, programmable consumer electronics, mobile telephones,mobile computing and/or communication devices, network PCs,minicomputers, mainframe computers, distributed computing environmentsthat include any of the above systems or devices, and the like. Suchsystems or devices may accept input from the user in any way, includingfrom input devices such as a keyboard or pointing device, via gestureinput, proximity input (such as by hovering), and/or via voice input.

Examples of the disclosure may be described in the general context ofcomputer-executable instructions, such as program modules, executed byone or more computers or other devices in software, firmware, hardware,or a combination thereof. The computer-executable instructions may beorganized into one or more computer-executable components or modules.Generally, program modules include, but are not limited to, routines,programs, objects, components, and data structures that performparticular tasks or implement particular abstract data types. Aspects ofthe disclosure may be implemented with any number and organization ofsuch components or modules. For example, aspects of the disclosure arenot limited to the specific computer-executable instructions or thespecific components or modules illustrated in the figures and describedherein. Other examples of the disclosure may include differentcomputer-executable instructions or components having more or lessfunctionality than illustrated and described herein.

In examples involving a general-purpose computer, aspects of thedisclosure transform the general-purpose computer into a special-purposecomputing device when configured to execute the instructions describedherein.

The examples illustrated and described herein as well as examples notspecifically described herein but within the scope of aspects of thedisclosure constitute exemplary means for a coarse-grained scheduler.For example, the elements illustrated in FIG. 1, FIG. 2, FIG. 3, FIG. 4,and FIG. 5, such as when encoded to perform the operations illustratedin FIG. 6, FIG. 7, and FIG. 8 constitute exemplary means for receivingan identification of a virtual machine (VM) to be placed on a host in aplurality of hosts within a cluster; exemplary means for selecting apredetermined number of hosts from each queue in a plurality of queuesto generate a candidate set of hosts; exemplary means for retrievingresource statistics for a set of VMs associated and the candidate set ofhosts; exemplary means for analyzing the candidate set of hosts inaccordance with a coarse-grained, resource-based optimization to selecta host for the identified VM; and exemplary means for placing theidentified VM on the selected host.

The examples illustrated and described herein as well as examples notspecifically described herein but within the scope of aspects of thedisclosure also constitute exemplary means for a fine-grained scheduler.For example, the elements illustrated in FIG. 1, FIG. 2, FIG. 3, FIG. 4,and FIG. 5, such as when encoded to perform the operations illustratedin FIG. 6, FIG. 7, and FIG. 8 constitute exemplary means for analyzing aset of communication graphs associated with the plurality of VMs togenerate a set of penalty scores; exemplary means for selecting a set ofVMs for relocation based on the set of penalty scores; and exemplarymeans for relocating a first VM in the set of VMs from a first host inthe cluster to a second host in the cluster to minimize a distancebetween the first VM and a second VM in the plurality of VMs.

The order of execution or performance of the operations in examples ofthe disclosure illustrated and described herein is not essential, unlessotherwise specified. That is, the operations may be performed in anyorder, unless otherwise specified, and examples of the disclosure mayinclude additional or fewer operations than those disclosed herein. Forexample, it is contemplated that executing or performing a particularoperation before, contemporaneously with, or after another operation iswithin the scope of aspects of the disclosure.

When introducing elements of aspects of the disclosure or the examplesthereof, the articles “a,” “an,” “the,” and “said” are intended to meanthat there are one or more of the elements. The terms “comprising,”“including,” and “having” are intended to be inclusive and mean thatthere may be additional elements other than the listed elements. Theterm “exemplary” is intended to mean “an example of” The phrase “one ormore of the following: A, B, and C” means “at least one of A and/or atleast one of B and/or at least one of C.”

Having described aspects of the disclosure in detail, it will beapparent that modifications and variations are possible withoutdeparting from the scope of aspects of the disclosure as defined in theappended claims. As various changes could be made in the aboveconstructions, products, and methods without departing from the scope ofaspects of the disclosure, it is intended that all matter contained inthe above description and shown in the accompanying drawings shall beinterpreted as illustrative and not in a limiting sense.

What is claimed is:
 1. A method for resource scheduling, comprising:receiving an identification of a virtual computing instance to be placedon a host in a plurality of hosts within a cluster; selecting a numberof hosts from each queue from a plurality of queues to generate acandidate set of hosts, wherein the queue comprises an ordered list ofhosts from the plurality of hosts; retrieving resource statistics for aset of virtual computing instances associated with the candidate set ofhosts and for the candidate set of hosts; analyzing the candidate set ofhosts using a coarse-grained optimization, based on the retrievedresource statistics, to select the host for the identified virtualcomputing instance; placing the identified virtual computing instance onthe selected host; analyzing a set of communication graphs associatedwith the set of virtual computing instances to generate a set of penaltyscores based on a penalty function; selecting a pair of virtualcomputing instances for co-location based on the set of penalty scores;and relocating a first virtual computing instance in the pair of virtualcomputing instances from a first host in the cluster to a second host inthe cluster to minimize a distance between the first virtual computinginstance and a second virtual computing instance in the pair of virtualcomputing instances, wherein relocating the first virtual computinginstance reduces at least one penalty score in the set of penalty scoresassociated with the pair of virtual computing instances.
 2. The methodof claim 1, wherein the identified virtual computing instance is avirtual machine (VM) and the method further comprises: updating hostresource capacity to reflect network resources consumed by theidentified VM.
 3. The method of claim 1, wherein the pair of virtualcomputing instances is a first pair of VMs and the method furthercomprises: selecting a second pair of VMs for co-location based on theset of penalty scores; and relocating at least one VM in the second pairof VMs to a different host in the cluster to reduce a penalty score inthe set of penalty scores associated with the at least one VM.
 4. Themethod of claim 1, further comprising: tagging each virtual computinginstance in a set of communicating virtual computing instances with agroup tag, wherein the group tag indicates a virtual computing instanceis to be placed on a same host or on a same rack as one or more othervirtual computing instances in the set of communicating virtualcomputing instances.
 5. The method of claim 1, further comprising:checking the identified virtual computing instance for a group tag priorto selecting the host for the identified virtual computing instance,wherein the group tag indicates the identified virtual computinginstance is to be placed together on a same host or a same rack asanother virtual computing instance having the same group tag.
 6. Themethod of claim 1, further comprising: placing the each host in theplurality of hosts into at least one queue in the plurality of queues inan order corresponding to at least one resource associated with the eachhost.
 7. A system, comprising: at least one processor associated with atleast one server in a cluster; a fine-grained scheduler executable bythe at least one processor, the fine-grained scheduler causing the atleast one processor to at least: receive an identification of a virtualcomputing instance to be placed on a host in a plurality of hosts withina cluster; select a number of hosts from each queue from a plurality ofqueues to generate a candidate set of hosts, wherein the queue comprisesan ordered list of hosts from the plurality of hosts; retrieve resourcestatistics for a set of virtual computing instances associated with thecandidate set of hosts and for the candidate set of hosts; analyze thecandidate set of hosts using a coarse-grained optimization, based on theretrieved resource statistics, to select the host for the identifiedvirtual computing instance; place the identified virtual computinginstance on the selected host; analyze a set of communication graphsassociated with the set of virtual computing instances to generate a setof penalty scores based on a penalty function; select a pair of virtualcomputing instances for co-location based on the set of penalty scores;and relocate a first virtual computing instance in the pair of virtualcomputing instances from a first host in the cluster to a second host inthe cluster to minimize a distance between the first virtual computinginstance and a second virtual computing instance in the pair of virtualcomputing instances, wherein relocating the first virtual computinginstance reduces at least one penalty score in the set of penalty scoresassociated with the pair of virtual computing instances.
 8. The systemof claim 7, wherein the identified virtual computing instance is avirtual machine (VM) and the fine-grained scheduler further causes theat least one processor to at least: update host resource capacity toreflect network resources consumed by the identified VM.
 9. The systemof claim 7, wherein the pair of virtual computing instances is a firstpair of VMs and the fine-grained scheduler further causes the at leastone processor to at least: select a second pair of VMs for co-locationbased on the set of penalty scores; and relocate at least one VM in thesecond pair of VMs to a different host in the cluster to reduce apenalty score in the set of penalty scores associated with the at leastone VM.
 10. The system of claim 7, wherein the fine-grained schedulerfurther causes the at least one processor to at least: tag each virtualcomputing instance in a set of communicating virtual computing instanceswith a group tag, wherein the group tag indicates a virtual computinginstance is to be placed on a same host or on a same rack as one or moreother virtual computing instances in the set of communicating virtualcomputing instances.
 11. The system of claim 7, wherein the fine-grainedscheduler further causes the at least one processor to at least: checkthe identified virtual computing instance for a group tag prior toselecting the host for the identified virtual computing instance,wherein the group tag indicates the identified virtual computinginstance is to be placed together on a same host or a same rack asanother virtual computing instance having the same group tag.
 12. Thesystem of claim 7, wherein the fine-grained scheduler further causes theat least one processor to at least: place the each host in the pluralityof hosts into at least one queue in the plurality of queues in an ordercorresponding to at least one resource associated with the each host.13. A non-transitory computer-readable medium embodying instructionsexecutable by at least one processor, the instructions causing the atleast one processor to at least: receive an identification of a virtualcomputing instance to be placed on a host in a plurality of hosts withina cluster; select a number of hosts from each queue from a plurality ofqueues to generate a candidate set of hosts, wherein the queue comprisesan ordered list of hosts from the plurality of hosts; retrieve resourcestatistics for a set of virtual computing instances associated with thecandidate set of hosts and for the candidate set of hosts; analyze thecandidate set of hosts using a coarse-grained optimization, based on theretrieved resource statistics, to select the host for the identifiedvirtual computing instance; place the identified virtual computinginstance on the selected host; analyze a set of communication graphsassociated with the set of virtual computing instances to generate a setof penalty scores based on a penalty function; select a pair of virtualcomputing instances for co-location based on the set of penalty scores;and relocate a first virtual computing instance in the pair of virtualcomputing instances from a first host in the cluster to a second host inthe cluster to minimize a distance between the first virtual computinginstance and a second virtual computing instance in the pair of virtualcomputing instances, wherein relocating the first virtual computinginstance reduces at least one penalty score in the set of penalty scoresassociated with the pair of virtual computing instances.
 14. Thenon-transitory computer-readable medium of claim 13, wherein theidentified virtual computing instance is a virtual machine (VM) and theinstructions further cause the at least one processor to at least:update host resource capacity to reflect network resources consumed bythe identified VM.
 15. The non-transitory computer-readable medium ofclaim 13, wherein the instructions further cause the at least oneprocessor to at least: tag each virtual computing instance in a set ofcommunicating virtual computing instances with a group tag, wherein thegroup tag indicates a virtual computing instance is to be placed on asame host or on a same rack as one or more other virtual computinginstances in the set of communicating virtual computing instances. 16.The non-transitory computer-readable medium of claim 13, wherein theinstructions further cause the at least one processor to at least: checkthe identified virtual computing instance for a group tag prior toselecting the host for the identified virtual computing instance,wherein the group tag indicates the identified virtual computinginstance is to be placed together on a same host or a same rack asanother virtual computing instance having the same group tag.
 17. Thenon-transitory computer-readable medium of claim 13, wherein theinstructions further cause the at least one processor to at least: placethe each host in the plurality of hosts into at least one queue in theplurality of queues in an order corresponding to at least one resourceassociated with the each host.